Search CORE

25 research outputs found

Discriminative Speech Recognition Rescoring with Pre-trained Language Models

Author: Bulyko Ivan
Gandhe Ankur
Gu Yile
Kolehmainen Jari
Rastrow Ariya
Shivakumar Prashanth Gurunath
Publication venue
Publication date: 09/10/2023
Field of study

Second pass rescoring is a critical component of competitive automatic speech recognition (ASR) systems. Large language models have demonstrated their ability in using pre-trained information for better rescoring of ASR hypothesis. Discriminative training, directly optimizing the minimum word-error-rate (MWER) criterion typically improves rescoring. In this study, we propose and explore several discriminative fine-tuning schemes for pre-trained LMs. We propose two architectures based on different pooling strategies of output embeddings and compare with probability based MWER. We conduct detailed comparisons between pre-trained causal and bidirectional LMs in discriminative settings. Experiments on LibriSpeech demonstrate that all MWER training schemes are beneficial, giving additional gains upto 8.5\% WER. Proposed pooling variants achieve lower latency while retaining most improvements. Finally, our study concludes that bidirectionality is better utilized with discriminative training.Comment: ASRU 202

arXiv.org e-Print Archive

Scaling Laws for Discriminative Speech Recognition Rescoring Models

Author: Bulyko Ivan
Gandhe Ankur
Gu Yile
Kolehmainen Jari
Rastrow Ariya
Shivakumar Prashanth Gurunath
Publication venue
Publication date: 27/06/2023
Field of study

Recent studies have found that model performance has a smooth power-law relationship, or scaling laws, with training data and model size, for a wide range of problems. These scaling laws allow one to choose nearly optimal data and model sizes. We study whether this scaling property is also applicable to second-pass rescoring, which is an important component of speech recognition systems. We focus on RescoreBERT as the rescoring model, which uses a pre-trained Transformer-based architecture fined tuned with an ASR discriminative loss. Using such a rescoring model, we show that the word error rate (WER) follows a scaling law for over two orders of magnitude as training data and model size increase. In addition, it is found that a pre-trained model would require less data than a randomly initialized model of the same size, representing effective data transferred from pre-training step. This effective data transferred is found to also follow a scaling law with the data and model size

arXiv.org e-Print Archive

Personalization for BERT-based Discriminative Speech Recognition Rescoring

Author: Bulyko Ivan
Gandhe Ankur
Gourav Aditya
Gu Yile
Kolehmainen Jari
Rastrow Ariya
Shivakumar Prashanth Gurunath
Publication venue
Publication date: 13/07/2023
Field of study

Recognition of personalized content remains a challenge in end-to-end speech recognition. We explore three novel approaches that use personalized content in a neural rescoring step to improve recognition: gazetteers, prompting, and a cross-attention based encoder-decoder model. We use internal de-identified en-US data from interactions with a virtual voice assistant supplemented with personalized named entities to compare these approaches. On a test set with personalized named entities, we show that each of these approaches improves word error rate by over 10%, against a neural rescoring baseline. We also show that on this test set, natural language prompts can improve word error rate by 7% without any training and with a marginal loss in generalization. Overall, gazetteers were found to perform the best with a 10% improvement in word error rate (WER), while also improving WER on a general test set by 1%

arXiv.org e-Print Archive

Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

Author: Bulyko Ivan
Ghosh Shalini
Gu Yile
Liu Yi-Chieh
Stolcke Andreas
Yang Chao-Han Huck
Publication venue
Publication date: 10/10/2023
Field of study

We explore the ability of large language models (LLMs) to act as speech recognition post-processors that perform rescoring and error correction. Our first focus is on instruction prompting to let LLMs perform these task without fine-tuning, for which we evaluate different prompting schemes, both zero- and few-shot in-context learning, and a novel task activation prompting method that combines causal instructions and demonstration to increase its context windows. Next, we show that rescoring only by in-context learning with frozen LLMs achieves results that are competitive with rescoring by domain-tuned LMs, using a pretrained first-pass recognition system and rescoring output on two out-of-domain tasks (ATIS and WSJ). By combining prompting techniques with fine-tuning we achieve error rates below the N-best oracle level, showcasing the generalization power of the LLMs.Comment: Accepted to IEEE Automatic Speech Recognition and Understanding (ASRU) 2023. 8 pages. 2nd version revised from Sep 29th's versio

arXiv.org e-Print Archive

PROCTER: PROnunciation-aware ConTextual adaptER for personalized speech recognition in neural transducers

Author: Bulyko Ivan
Filimonov Denis
Gandhe Ankur
Liu Jing
Luo Qi
Pandey Rahul
Rastrow Ariya
Ren Roger
Stolcke Andreas
Strimel Grant
Publication venue
Publication date: 29/03/2023
Field of study

End-to-End (E2E) automatic speech recognition (ASR) systems used in voice assistants often have difficulties recognizing infrequent words personalized to the user, such as names and places. Rare words often have non-trivial pronunciations, and in such cases, human knowledge in the form of a pronunciation lexicon can be useful. We propose a PROnunCiation-aware conTextual adaptER (PROCTER) that dynamically injects lexicon knowledge into an RNN-T model by adding a phonemic embedding along with a textual embedding. The experimental results show that the proposed PROCTER architecture outperforms the baseline RNN-T model by improving the word error rate (WER) by 44% and 57% when measured on personalized entities and personalized rare entities, respectively, while increasing the model size (number of trainable parameters) by only 1%. Furthermore, when evaluated in a zero-shot setting to recognize personalized device names, we observe 7% WER improvement with PROCTER, as compared to only 1% WER improvement with text-only contextual attentionComment: To appear in Proc. IEEE ICASS

arXiv.org e-Print Archive

Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition

Author: Bulyko Ivan
Chen I-Fan
Dinh Tuan
Filimonov Denis
Gandhe Ankur
Ghosh Shalini
Gourav Aditya
Gu Yile
Kolehmainen Jari
Liu Yi-Chieh
Luo Qi
Rastow Ariya
Ren Roger
Ryu Sungho
Shivakumar Prashanth G.
Stolcke Andreas
Yang Chao-Han Huck
Yu Yu
Publication venue
Publication date: 10/10/2023
Field of study

We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we present a method based on low-rank decomposition to train a rescoring BERT model and adapt it to new domains using only a fraction (0.08%) of the pretrained parameters. These inserted matrices are optimized through a discriminative training objective along with a correlation-based regularization loss. The proposed low-rank adaptation Rescore-BERT (LoRB) architecture is evaluated on LibriSpeech and internal datasets with decreased training times by factors between 5.4 and 3.6.Comment: Accepted to IEEE ASRU 2023. Internal Review Approved. Revised 2nd version with Andreas and Huck. The first version is in Sep 29th. 8 page

arXiv.org e-Print Archive